目录 | DIM导览(第五卷第三期)
数据与信息管理
ISSN 2543-9251
Volume 5, Issue 3
https://content.sciendo.com/view/journals/dim/dim-overview.xml
大数据时代的知识实体抽取与文本挖掘
Editorial
Knowledge Entity Extraction and Text Mining in the Era of Big Data
Authors: Chengzhi Zhang, Philipp Mayr, Wei Lu and Yi Zhang发现蓬勃发展的生物实体及其与基金的关系
Research Article
Discovering Booming Bio-entities and Their Relationship with Funds
Authors: Fang Tan, Tongyang Zhang, Siting Yang, Xiaoyan Wu and Jian XuAbstract: With the increasing pressure on the National Institutes of Health (NIH) budget nowadays, it is such a major challenge to cut waste and improve efficiency in the research funding allocation. To meet this challenge, this paper explores research hotspots and disciplinary trends of the biomedical area, and discusses the relationship between these factors and the government funding, thereby uncovering biomedical hotspots of interest to academia and the evolution law of the U.S. federal government funding through an entitymetrics analysis. Considering that the rapid proliferation of biomedical literature provides large amounts of information resources for knowledge discovery, entities extracted from articles in PubMed and NIH-funded projects during 1988–2017 are taken as experimental data. They are divided into four categories: species, diseases, genes, and drugs. Subsequently, a comparative analysis of entity trajectories in the four domains is performed, which includes occurrence frequency calculations of disease entities to explore frequency variation trends in high-frequency entities and the situation of the distribution of research funds. Finally, we conduct an evolutionary analysis of two sides, respectively: the relationship between research popularity and the amount of funding; the relationship between research popularity and the number of funded projects. The results suggest that research on gene and disease entities is at the stage of rapid development. Diseases with high prevalence rate and mortality and diseases associated with genetic factors will be the emphasis of research trends in the future. The distribution of NIH grant appears obvious long tail effect and can influence overall trends in the heat of research topics. We also find that there is a strong linear correlation between the research popularity of bio-entities, and the amount and number of funding grants, respectively. However, the impact of the amount and number of grant funds on the entity research popularity is decreasing. The above results indicate the extensive applicability of entitymetrics in funding research.
Keywords: entity metrics; research funds; biomedicine; evolutionary analysis一种用于科学文本术语抽取的Pattern和POS自动学习方法
Research Article
A Pattern and POS Auto-Learning Method for Terminology Extraction from Scientific Text
Authors: Wei Shao,Bolin Hua and Linqi SongAbstract: A lot of new scientific documents are being published on various platforms every day. It is more and more imperative to quickly and efficiently discover new words and meanings from these documents. However, most of the related works rely on labeled data, and it is quite difficult to deal with unlabeled new documents efficiently. For this, we have introduced an unsupervised method based on sentence patterns and part of speech (POS) sequences. Our method just needs a few initial learnable patterns to obtain the initial terminology tokens and their POS sequences. In this process, new patterns are constructed and can match more sentences to find more POS sequences of terminology. Finally, we use obtained POS sequences and sentence patterns to extract terminology terms in new scientific text. Experiments on paper abstracts from Web of Knowledge show that this method is practical and can achieve a good performance on our test data.
Keywords: auto-learning; terminology extraction; unsupervised method; scientific text电子政务中的公共信息主题自动分类
Research Article
Automatic Subject Classification of Public Messages in E-government Affairs
Authors: Pei Pan andYijin ChenAbstract: Public messages on the Internet political inquiry platform rely on manual classification, which has the problems of heavy workload, low efficiency, and high error rate. A Bi-directional long short-term memory (Bi-LSTM) network model based on attention mechanism was proposed in this paper to realize the automatic classification of public messages. Considering the network political inquiry data set provided by the BdRace platform as samples, the Bi-LSTM algorithm is used to strengthen the correlation between the messages before and after the training process, and the semantic attention to important text features is strengthened in combination with the characteristics of attention mechanism. Feature weights are integrated through the full connection layer to carry out classification calculations. The experimental results show that the F1 value of the message classification model proposed here reaches 0.886 and 0.862, respectively, in the data set of long text and short text. Compared with three algorithms of long short-term memory (LSTM), logistic regression, and naive Bayesian, the Bi-LSTM model can achieve better results in the automatic classification of public message subjects.
Keywords: Internet politics inquiry; public message; subject classification; Bi-LSTM model; attention mechanism